[Dev][feat] Support CUDA Graph capture offloading modules by lhb8125 · Pull Request #3219 · NVIDIA/Megatron-LM

lhb8125 · 2026-02-03T04:36:38Z

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

…ithub.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_cuda_graph

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 · 2026-03-02T09:26:26Z

/ok to test 0200121

buptzyb · 2026-03-04T08:47:28Z

docs/api-guide/fine_grained_activation_offloading.md

+Fine-grained offloading is compatible with CUDA graphs. When CUDA graph is enabled, the following constraints apply:
+
+- `attn_norm` and `mlp_norm` **cannot** be offloaded (they cross CUDA graph boundaries).
+- `cuda_graph_scope` must include `attn` and `moe_router`.


Can I use "moe" scope if I'm in a drop-pad MoE?

Can I offload attention part modules if my cuda graph scope is only "moe_router"? This may be needed since some cases have dynamic-shaped attention so only the router part can be captured.

I removed this hard limitation, now the scope could be moe_router alone or moe.

buptzyb · 2026-03-04T08:54:45Z

docs/api-guide/fine_grained_activation_offloading.md

+
+Fine-grained offloading is compatible with CUDA graphs. When CUDA graph is enabled, the following constraints apply:
+
+- `attn_norm` and `mlp_norm` **cannot** be offloaded (they cross CUDA graph boundaries).


unless using "moe" cudagrpah scope in a drop-pad or sync-free MoE.

what if we only capture moe_router or moe_preprocess? Is it still true?

I think so. If we only capture moe_router, mlp_norm works as the input buffer of the graph, so not offloadable. The only exception is that we use attn+moe scope for drop-pad MoE, then the mlp_norm is totally inside the graph, so offloadable.

btw you cannot only capture moe_preprocess . moe_preprocess must go together with moe_router .

megatron/core/transformer/transformer_layer.py

2. remove flush_delayed_groups() when the training is not in replay mode Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

…graph

lhb8125 · 2026-03-05T09:06:42Z

/ok to test b481fa9

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

…ttps://github.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_refactor_cuda_graph

…graph

lhb8125 · 2026-03-05T13:21:16Z

/ok to test ce84682

buptzyb · 2026-03-05T13:37:58Z

docs/api-guide/fine_grained_activation_offloading.md

+3. **Apply fraction**: Only a fraction of eligible groups are actually offloaded (controlled by `activation_offload_fraction`).
+4. **Print summary table**: An ASCII table of per-rank offload bytes is printed for debugging.
+
+### CPU Tensor Pool


GPU Tensor Pool?

It's indeed a CPU tensor pool, which reuses the cpu tensors in pool to avoid cudaMallocHost, since it's not supported by cuda graph.

GPU tensors are allocated and freed on-demand from pytorch memory pool.

buptzyb · 2026-03-05T13:40:30Z

docs/api-guide/fine_grained_activation_offloading.md

+
+### Warmup and Adaptive Offloading
+
+The first training iteration serves as a **warmup phase** where the manager records tensor groups, their sizes, and the execution order. After warmup, a `post_warmup_callback` runs to:


So we cannot capture cudagraphs on the first training iteration? If so, we should assert cuda_graph_warmup_steps>0 when offloading is enabled.

Yes, the assertion was added but removed by accident. Let me add it back.

lhb8125 · 2026-03-06T01:29:37Z

/claude review

megatron/core/pipeline_parallel/fine_grained_activation_offload.py

megatron/core/transformer/multi_latent_attention.py

buptzyb · 2026-03-06T03:51:30Z

megatron/core/transformer/moe/experts.py


        # This is to avoid the CPU overhead of multiple d2h copies
-        if self.offload_expert_fc1:
+        if self.offload_expert_fc1 and not self.config.fp8:


Anything special about fp8?

This was to avoid multiple d2h copies, but it also comes with the doubling bytes of offloading. So it's a tradeoff. Since we can delay the offloading after graph replay, we could disable the save_original_input by default.

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 · 2026-03-06T06:54:50Z

/ok to test a6e16a9

lhb8125 and others added 30 commits October 29, 2025 02:46

renaming golden values

1219a26

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

fix bug: accuracy issu because of recomputing and offloading same module

ce6e661

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_fix

d04d741

format

2fe4aeb

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

update golden values

fb3f7c3

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_fix

5001e2b

update golden values

9937890

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

update model_config and golden values

6c83118

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

format

33a38f5

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

update golden values

6c76b07

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_fix

4d83f69

temp save

8e72b44

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

support offloading+cuda graph

1646f04

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_cuda_graph

43973a7

support PP=1

a177cf5

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

support VPP

f7cfbba

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

bug fix

6d475ad

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

support VPP

089da6c

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

code refactor

35b0f97

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

big code refactor and format

df09b85

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_cuda_graph

06ef4e2

minor fix

12cb8de

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

minor fix

3cf19b7

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

dump offloading information

d0fc888

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_cuda_graph

bc47650

fix ut

b7c0fba

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'hongbinl/activation_offloading_cuda_graph' of https://g…

b18e69b

…ithub.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_cuda_graph

format

ae4e2b5

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

fit ut

b797438

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

delay d2h copies until finishing cuda graph

6cec22f

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 2, 2026 07:24 Inactive

add flag to control flush_delayed_groups in fine_grained_callables.py

0200121

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 2, 2026 09:27 Inactive

buptzyb reviewed Mar 4, 2026

View reviewed changes

lhb8125 changed the title ~~Support CUDA Graph capture offloading modules~~ [Dev][feat] Support CUDA Graph capture offloading modules Mar 4, 2026

jiemingz reviewed Mar 4, 2026

View reviewed changes

megatron/core/transformer/transformer_layer.py Outdated Show resolved Hide resolved

lhb8125 and others added 4 commits March 5, 2026 01:05

1. move backward_record() to te_cuda_graph_capture()

c8bd90d

2. remove flush_delayed_groups() when the training is not in replay mode Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

format

ddd67d2

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

remove the knob forward_only when executing reset()

e989b95

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_refactor_cuda_…

b481fa9

…graph

copy-pr-bot bot temporarily deployed to test March 5, 2026 09:07 Inactive

lhb8125 added 3 commits March 5, 2026 05:18

fix ut and reviewer's comments

19fe6b3

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'hongbinl/activation_offloading_refactor_cuda_graph' of h…

cf04b4a

…ttps://github.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_refactor_cuda_graph

Merge branch 'dev' into hongbinl/activation_offloading_refactor_cuda_…

ce84682

…graph

copy-pr-bot bot temporarily deployed to test March 5, 2026 13:22 Inactive

buptzyb reviewed Mar 5, 2026

View reviewed changes

jiemingz approved these changes Mar 5, 2026

View reviewed changes

claude bot reviewed Mar 6, 2026

View reviewed changes

megatron/core/pipeline_parallel/fine_grained_activation_offload.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 6, 2026

View reviewed changes

megatron/core/pipeline_parallel/fine_grained_activation_offload.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 6, 2026

View reviewed changes

megatron/core/pipeline_parallel/fine_grained_activation_offload.py Show resolved Hide resolved

claude bot reviewed Mar 6, 2026

View reviewed changes

megatron/core/transformer/multi_latent_attention.py Show resolved Hide resolved

buptzyb reviewed Mar 6, 2026

View reviewed changes

lhb8125 added 2 commits March 5, 2026 22:47

resolve comments

d8da479

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

remove redundant offloading doc

a6e16a9

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 6, 2026 06:55 Inactive


		Fine-grained offloading is compatible with CUDA graphs. When CUDA graph is enabled, the following constraints apply:

		- `attn_norm` and `mlp_norm` cannot be offloaded (they cross CUDA graph boundaries).


		### Warmup and Adaptive Offloading

		The first training iteration serves as a warmup phase where the manager records tensor groups, their sizes, and the execution order. After warmup, a `post_warmup_callback` runs to:

Conversation

lhb8125 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

lhb8125 commented Mar 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhb8125 Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhb8125 commented Mar 5, 2026

Uh oh!

lhb8125 commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhb8125 commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhb8125 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhb8125 commented Feb 3, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`

lhb8125 Mar 5, 2026 •

edited

Loading